Four "Corgie" model vehicles were used for the experiment: a double decker bus, Cheverolet van, Saab 9000 and an Opel Manta 400 cars. This particular combination of vehicles was chosen with the expectation that the bus, van and either one of the cars would be readily distinguishable, but it would be more difficult to distinguish between the cars.
import pandas as pd
import numpy as np
from sklearn.cluster import KMeans
import seaborn as sns
from scipy.stats import zscore
import matplotlib.pyplot as plt
%matplotlib inline
import warnings
warnings.filterwarnings('ignore')
df=pd.read_csv('vehicle.csv')
df.head()
df.isnull().sum()
df.dropna(axis=0, how='any', thresh=None, subset=None, inplace=True)
df.isnull().sum()
Since the variable is categorical, you can use value_counts function
df['class'].value_counts()
sns.countplot(df['class'])
df.isnull().sum()
Since the dimensions of the data are not really known to us, it would be wise to standardize the data using z scores before we go for any clustering methods. You can use zscore function to do this
#Scaling
from scipy.stats import zscore
df.info()
df.columns
df1 = df.copy()
df1.head()
cols = ['compactness', 'circularity', 'distance_circularity', 'radius_ratio',
'pr.axis_aspect_ratio', 'max.length_aspect_ratio', 'scatter_ratio',
'elongatedness', 'pr.axis_rectangularity', 'max.length_rectangularity',
'scaled_variance', 'scaled_variance.1', 'scaled_radius_of_gyration',
'scaled_radius_of_gyration.1', 'skewness_about', 'skewness_about.1',
'skewness_about.2', 'hollows_ratio']
df1 = df[cols].applymap(np.int64)
df1.head()
df.head()
df1 = df1.apply(zscore)
df1.head()
import seaborn as sns
sns.pairplot(df,diag_kind='kde')
sns.pairplot(df,diag_kind='kde', hue='class')
You can later use this array to plot the elbow plot
model = KMeans(n_clusters = 3)
model
Iterating values of k from 1 to 10 fit K means model Using c distance - Get the measure for Sum of squares error.
Here, logically K-Means attempts to minimize distortion defined by the the sum of the squared distances between each observation and its closest centroid.
cluster_range = range( 1, 15 )
cluster_errors = []
for num_clusters in cluster_range:
clusters = KMeans( num_clusters, n_init = 10 )
clusters.fit(df1)
# labels = clusters.labels_
# centroids = clusters.cluster_centers_
cluster_errors.append( clusters.inertia_ )
clusters_df = pd.DataFrame( { "num_clusters":cluster_range, "cluster_errors": cluster_errors } )
clusters_df[0:15]
Use Matplotlib to plot the scree plot - Note: Scree plot plots distortion vs the no of clusters
# Elbow plot
plt.figure(figsize=(12,6))
plt.plot( clusters_df.num_clusters, clusters_df.cluster_errors, marker = "o" )
here we can take the value of k as 3
kmeans = KMeans(n_clusters=3, n_init = 15, random_state=2345)
Note: Since the data has more than 2 dimension we cannot visualize the data. As an alternative, we can observe the centroids and note how they are distributed across different dimensions
kmeans.fit(df1)
Hint: Use pd.Dataframe function
centroids = kmeans.cluster_centers_
centroids
centroid_df = pd.DataFrame(centroids, columns = list(df1) )
centroid_df
## creating a new dataframe only for labels and converting it into categorical variable
df_labels = pd.DataFrame(kmeans.labels_ , columns = list(['labels']))
df_labels['labels'] = df_labels['labels'].astype('category')
# Joining the label dataframe with the Wine data frame to create wine_df_labeled. Note: it could be appended to original dataframe
df_labeled = df1.join(df_labels)
df_analysis = (df_labeled.groupby(['labels'] , axis=0)).head(4177) # the groupby creates a groupeddataframe that needs
# to be converted back to dataframe. I am using .head(30000) for that
df_analysis
df_labeled['labels'].value_counts()
from mpl_toolkits.mplot3d import Axes3D
fig = plt.figure(figsize=(8, 6))
ax = Axes3D(fig, rect=[0, 0, .95, 1], elev=20, azim=100)
kmeans.fit(df1)
labels = kmeans.labels_
ax.scatter(df1.iloc[:, 0], df1.iloc[:, 1], df1.iloc[:, 2],c=labels.astype(np.float), edgecolor='k')
ax.w_xaxis.set_ticklabels([])
ax.w_yaxis.set_ticklabels([])
ax.w_zaxis.set_ticklabels([])
ax.set_xlabel('class1')
ax.set_ylabel('class2')
ax.set_zlabel('class3')
ax.set_title('3D plot of KMeans Clustering')